hard example
CS-Isolate: Extracting Hard Confident Examples by Content and Style Isolation
Label noise widely exists in large-scale image datasets. To mitigate the side effects of label noise, state-of-the-art methods focus on selecting confident examples by leveraging semi-supervised learning. Existing research shows that the ability to extract hard confident examples, which are close to the decision boundary, significantly influences the generalization ability of the learned classifier.In this paper, we find that a key reason for some hard examples being close to the decision boundary is due to the entanglement of style factors with content factors. The hard examples become more discriminative when we focus solely on content factors, such as semantic information, while ignoring style factors. Nonetheless, given only noisy data, content factors are not directly observed and have to be inferred.To tackle the problem of inferring content factors for classification when learning with noisy labels, our objective is to ensure that the content factors of all examples in the same underlying clean class remain unchanged as their style information changes.To achieve this, we utilize different data augmentation techniques to alter the styles while regularizing content factors based on some confident examples. By training existing methods with our inferred content factors, CS-Isolate proves their effectiveness in learning hard examples on benchmark datasets. The implementation is available at https://github.com/tmllab/2023
Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets
Pikus, Benjamin, Tiwari, Pratyush Ranjan, Ye, Burton
Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning
A Appendix
Hyperparameters V alue Number of encoder (decoder) layers 6 Number of layers in the feed forward network 2 Number of hidden units in the feed forward network 128 Mask filter size 3 Mask number of filters 16 Ratio of residual connection 1.5 Dropout rate 0.1 Optimizer Adam Warm-up steps 4000 Learning rate p d min ( p t, t 4000 Unless otherwise specified, the task performed in this section is selection sort (Section 4). Figure 6 shows the sorting performance of the transformers w/o mask supervision. Figure 7 shows sorting performances with different encoding schemes. In Figure 9, we show the strong generalization performance of the different architectures. While some changes are able to improve performance in this regime, the performance ultimately drops steeply as the length of the test sequence increases. The symbol e represents the end token.
MarginSel : Max-Margin Demonstration Selection for LLMs
Ambati, Rajeev Bhatt, Lester, James, Srivastava, Shashank, Chaturvedi, Snigdha
Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.34)
Review for NeurIPS paper: SuperLoss: A Generic Loss for Robust Curriculum Learning
Additional Feedback: Further comments: - The definition of hard and easy examples is limited to their respective confidence scores or losses. Although previous work has similar definitions, confidence or loss are not always good indicators of true easiness or hardness of samples, e.g. they could be erroneous at early iterations. The paper lacks an experiment that illustrates the validity of the above definition. These are probably hard or noisy examples that were mistreated as easy examples by the model? These are probably a mixture of easy, hard, and noisy examples with low confidence across the loss spectrum that were mistreated as hard examples by the model.
CS-Isolate: Extracting Hard Confident Examples by Content and Style Isolation
Label noise widely exists in large-scale image datasets. To mitigate the side effects of label noise, state-of-the-art methods focus on selecting confident examples by leveraging semi-supervised learning. Existing research shows that the ability to extract hard confident examples, which are close to the decision boundary, significantly influences the generalization ability of the learned classifier.In this paper, we find that a key reason for some hard examples being close to the decision boundary is due to the entanglement of style factors with content factors. The hard examples become more discriminative when we focus solely on content factors, such as semantic information, while ignoring style factors. Nonetheless, given only noisy data, content factors are not directly observed and have to be inferred.To tackle the problem of inferring content factors for classification when learning with noisy labels, our objective is to ensure that the content factors of all examples in the same underlying clean class remain unchanged as their style information changes.To achieve this, we utilize different data augmentation techniques to alter the styles while regularizing content factors based on some confident examples.
What can Large Language Models Capture about Code Functional Equivalence?
Maveli, Nickil, Vergari, Antonio, Cohen, Shay B.
Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Singapore (0.04)
- (7 more...)
Maximizing V-information for Pre-training Superior Foundation Models
Yang, Wenxuan, Tan, Weimin, Zhang, Hanyu, Yan, Bo
Pre-training foundation models on large-scale datasets demonstrates exceptional performance. However, recent research questions this traditional notion, exploring whether an increase in pre-training data always leads to enhanced model performance. To address this issue, data-effective learning approaches have been introduced. However, current methods in this area lack a clear standard for sample selection. Our experiments reveal that by maximizing V-information, sample selection can be framed as an optimization problem, enabling effective improvement in model performance even with fewer samples. Under this guidance, we develop an optimal data-effective learning method (OptiDEL) to maximize V-information. The OptiDEL method generates hard samples to achieve or even exceed the performance of models trained on the full dataset while using substantially less data. We compare the OptiDEL method with state-of-the-art approaches finding that OptiDEL consistently outperforms existing approaches across different datasets, with foundation models trained on only 5% of the pre-training data surpassing the performance of those trained on the full dataset.
- Europe > Austria (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training
Verma, Bhuvanesh, Raithel, Lisa
Building on the methodology outlined (NLP) has seen significant advancements, beginning by Kanakarajan and Sankarasubbu (2023), with the introduction of word embeddings we assessed the zero-shot performance of various (Mikolov et al., 2013), followed by transformer instruction-tuned LLMs to identify the most effective architectures like BERT (Vaswani et al., 2017; Devlin model. Upon selecting the best LLM, we introduced et al., 2019), and specialized language models an auxiliary module during the fine-tuning (LMs) such as BioBERT (Lee et al., 2020) and process, which emphasized learning "hard" examples. PubMedBERT (Gu et al., 2021) tailored for the Taking inspiration from Korakakis and Vlachos biomedical domain. The advent of large language (2023), who experimented with various configurations models (LLMs) like GPT-3 (Brown et al., 2020), for the auxiliary module and highlighted commonly known as Chat-GPT, has further pushed its substantial impact on the final NLI system's the boundaries of NLP, showcasing capabilities performance, we explored various architectures for in diverse NLP tasks and even reasoning.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- (4 more...)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models
Li, Wei, Ma, Ren, Wu, Jiang, Gu, Chenya, Peng, Jiahui, Len, Jinyang, Zhang, Songyang, Yan, Hang, Lin, Dahua, He, Conghui
In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.